Based on data analysis using 2036 data points from five geographically close suburbs, we can conclude that distance from a train station has a small but relatively insignificant effect on townhouse prices. Therefore, new home buyers can choose more conveniently located homes without worrying about a significant increase in price.
We first generate a list of full list of names and the longitude and
latitude of the train stations of these respective suburbs. The list is
stored in main1_INPUT.txt (Appendix 1 ??). Then we have the
R code (Appendix 2 ??) reading this .txt file to output the
.csv files for each suburbs (Appendix 3 ?? Or github link).
Finally we read in the data through the code below.
# Get the list of CSV files in the 'csv_cache' directory
csv_files <- list.files(path = "csv_cache", pattern = "*.csv", full.names = TRUE)
# Initialize an empty data frame to store the combined data
combined_df <- data.frame()
# Loop through each file in the csv_files list
for (file in csv_files) {
# Read the CSV file
location_data <- read.csv(file)
# Categorize distance
location_data$"distance_to_train_station(km)" <- location_data$distance_to_train_station / 1000
# Classing distance
location_data$distance_class <- cut(location_data$"distance_to_train_station(km)",
breaks = c(0, 0.250, 0.500, 0.750, 1.000, 1.250, 1.500, 1.750, 2.000, 2.250, 2.500, 3.000, 3.250, 3.500, 3.750, 4.000))
# Combine the processed data frame with the combined_df data frame
combined_df <- rbind(combined_df, location_data)
}
# Inspect the combined number of suburbs
print(paste0("Total number of suburbs: ", length(csv_files)))
## [1] "Total number of suburbs: 136"
# Inspect the combined data frame
tail(combined_df)
## House_ID address bedroom bathroom
## 29781 810 7/24 Methven Street Mount Druitt 2770 3 1
## 29782 811 9/41 Methven Street Mount Druitt 2770 3 1
## 29783 812 1/21 Hythe Street Mount Druitt 2770 3 1
## 29784 813 1/14 Meacher Street Mount Druitt 2770 3 1
## 29785 814 4/34 Durham Street Mount Druitt 2770 3 1
## 29786 815 49/334 Woodstock Avenue Mount Druitt 2770 3 NA
## carspace soldprice yearsold latitude longitude
## 29781 1 154000 2001-09-01 -33.76247 150.8216
## 29782 1 156000 2001-09-01 -33.76186 150.8249
## 29783 1 400000 2001-08-01 -33.76308 150.8211
## 29784 1 189000 2001-08-01 -33.76042 150.8203
## 29785 1 178000 2001-07-01 -33.77166 150.8122
## 29786 NA 124000 2001-06-01 -33.75742 150.8206
## distance_to_train_station distance_to_train_station(km) distance_class
## 29781 802.9798 0.8029798 (0.75,1]
## 29782 967.3295 0.9673295 (0.75,1]
## 29783 729.2326 0.7292326 (0.5,0.75]
## 29784 1019.4336 1.0194336 (1,1.25]
## 29785 768.9916 0.7689916 (0.75,1]
## 29786 1354.9054 1.3549054 (1.25,1.5]
Data used for the report was scraped from the internet using the following link: https://www.auhouseprices.com/sold/list/NSW/.
In total, we analysed 136 suburbs across Sydney, containing a total of 29786 data entries. Each data entry contains a complete buy/sell history.
We used these variables and cleaned the data in the following ways:
A function was created to calculate straight line distance from townhouses to train stations, which inaccurately represents travel distance between the two. Some townhouses are likely closer to stations from neighbouring suburbs instead. The relevance of trains as a mode of transport may differ between different suburbs. Additionally, train stations often coincide with commercial centres which may affect selling price.
A significant assumption was that no amenities close to train stations would increase the price of townhouses (e.g. shops, schools), which may be confounding variables. Another assumption was that all stations, regardless of how major, had an equal effect on selling prices.
What is the effect of distance to train stations on Sydney’s housing prices?
Distances from stations were classed into 250 metre intervals to increase the readability of graphical summaries, as the data points produced cluttered scatterplots. A side-by-side boxplot was used to compare whether distance correlated to a change in price. The boxplot suggests there is no correlation between proximity to train stations and selling price. The residual plot illustrates clustering of data points on the bottom-left. Without random scatter, the data is not homoscedastic, hence a linear model is not appropriate.
The numerical summary suggested no correlation. The median selling price for houses between 0 and 250 metres was $506000, it increased to $560000 between 1.75 and 2 kilometres, then decreased to $360000 between 3.75 and 4 kilometres. The fluctuation in median selling price over distance discounts the possibility of a linear correlation. Properties in Sydney within 400 metres of train stations have higher price growth (4.5%) compared to properties between 800 and 1600 metres (0.3%)(Forbes, 2021). Other research suggests the train stations have an insignificant correlation with property prices (r=0.091) (p=0.380)(Berawi et al., 2020). Research suggests that number of rooms and building size was the most significant contributor to property pricing close to stations(Berawi et al., 2020). From our graphs, we see that an increase in car spaces and bathrooms was also linked to an increase in price, and so this could potentially be a confounding variable.
The number of confounding variables alongside a more complex trend could account for the lack of correlation observed. Prices seemed to increase with the number of bedrooms, car-spaces and bathrooms. Yet after controlling for them, there was still no correlation. This suggests there are further confounding variables unaccounted for.To account for inflation, a boxplot of selling price between 2000 and 2023 in Western Sydney suburbs was plotted. There was a general increase in townhouse price over the years. Inflation is also a significant confounding variable that has had a substantial effect on selling price. The complex interaction of variables which affect property price could explain the absence of a correlation.
Berawi, M. A., Miraj, P., Saroji, G., & Sari, M. (2020). Impact of rail transit station proximity to commercial property prices: Utilizing big data in Urban Real Estate. Journal of Big Data, 7(1), 1–17. https://doi.org/10.1186/s40537-020-00348-z
Bowes, D. R., & Ihlanfeldt, K. R. (2001). Identifying the impacts of rail transit stations on residential property values. Journal of Urban Economics, 50(1), 1–25. https://doi.org/10.1006/juec.2001.2214
Forbes, K. (2021, August 12). Does a train station increase the value of a property? Metropole Property Strategists. Retrieved April 10, 2023, from https://metropole.com.au/how-have-train-stations-affected-property-prices-in-sydney/#:~:text=It%20found%20that%20properties%20within,a%20growth%20rate%20of%200.3%25.
When did you team meet (date and time), and what did each team member contribute?
??
combined_df_1bed <-filter(combined_df, bedroom ==1)
combined_df_2bed <-filter(combined_df, bedroom ==2)
combined_df_3bed <-filter(combined_df, bedroom ==3)
combined_df_4bed <-filter(combined_df, bedroom ==4)
combined_df_5bed <-filter(combined_df, bedroom ==5)
par(mfrow=c(1,2))
ggplot(combined_df_1bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 1 Bedroom", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
ggplot(combined_df_1bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5, aes(fill=factor(carspace))) +
labs(title = "Sold Price vs Distance from Train Station for 1 Bedroom", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
ggplot(combined_df_2bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 2 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_2bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000e+04 3.500e+05 4.680e+05 7.890e+05 6.120e+05 2.147e+09
ggplot(combined_df_2bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5, aes(fill=factor(carspace))) +
labs(title = "Sold Price vs Distance from Train Station for 2 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_2bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000e+04 3.500e+05 4.680e+05 7.890e+05 6.120e+05 2.147e+09
ggplot(combined_df_3bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 3 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_3bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575 425000 570000 642811 745000 22867454
ggplot(combined_df_3bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5, aes(fill=factor(carspace))) +
labs(title = "Sold Price vs Distance from Train Station for 3 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_3bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575 425000 570000 642811 745000 22867454
ggplot(combined_df_4bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 4 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_4bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1650 534300 675000 763610 866000 15000000
ggplot(combined_df_4bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5, aes(fill=factor(carspace))) +
labs(title = "Sold Price vs Distance from Train Station for 4 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_4bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1650 534300 675000 763610 866000 15000000
ggplot(combined_df_5bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 5 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_5bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200000 659990 783000 864065 930000 3080000
ggplot(combined_df_5bed, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5, aes(fill=factor(carspace))) +
labs(title = "Sold Price vs Distance from Train Station for 5 Bedrooms", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
summary(combined_df_5bed$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200000 659990 783000 864065 930000 3080000
combined_df_1bed_1car <-filter(combined_df, bedroom ==1, carspace == 1)
combined_df_2bed_1car <-filter(combined_df, bedroom ==2, carspace == 1)
combined_df_2bed_2car <-filter(combined_df, bedroom ==2, carspace == 2)
combined_df_3bed_1car <-filter(combined_df, bedroom ==3, carspace == 1)
combined_df_3bed_2car <-filter(combined_df, bedroom ==3, carspace == 2)
combined_df_3bed_3car <-filter(combined_df, bedroom ==3, carspace == 3)
combined_df_3bed_4car <-filter(combined_df, bedroom ==3, carspace == 4)
combined_df_4bed_1car <-filter(combined_df, bedroom ==4, carspace == 1)
combined_df_4bed_2car <-filter(combined_df, bedroom ==4, carspace == 2)
combined_df_4bed_3car <-filter(combined_df, bedroom ==4, carspace == 3)
combined_df_4bed_4car <-filter(combined_df, bedroom ==4, carspace == 4)
combined_df_5bed_1car <-filter(combined_df, bedroom ==5, carspace == 1)
combined_df_5bed_2car <-filter(combined_df, bedroom ==5, carspace == 2)
combined_df_5bed_3car <-filter(combined_df, bedroom ==5, carspace == 3)
ggplot(combined_df_1bed_1car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 1 Bedroom and 1 Carspace", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_1bed_1car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61325 253000 375000 402900 487225 1330000
ggplot(combined_df_2bed_1car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 2 Bedrooms and 1 Carspace", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_2bed_1car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.700e+04 3.400e+05 4.600e+05 8.584e+05 6.000e+05 2.147e+09
ggplot(combined_df_2bed_2car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 2 Bedrooms and 2 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_2bed_2car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 66000 432625 547500 588482 700000 1581000
ggplot(combined_df_3bed_1car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 3 Bedrooms and 1 Carspace", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3bed_1car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50000 360000 500000 549951 655000 5850000
ggplot(combined_df_3bed_2car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 3 Bedrooms and 2 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3bed_2car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575 500000 637000 723621 840000 22867454
ggplot(combined_df_3bed_3car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 3 Bedrooms and 3 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3bed_2car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575 500000 637000 723621 840000 22867454
ggplot(combined_df_3bed_4car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 3 Bedrooms and 4 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3bed_4car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 305000 478500 590000 668278 809000 1600000
ggplot(combined_df_4bed_1car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 4 Bedrooms and 1 Carspace", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_4bed_1car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 175000 450000 620000 661696 770000 2750000
ggplot(combined_df_4bed_2car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 4 Bedrooms and 2 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_4bed_2car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1650 555000 690000 790771 890000 15000000
ggplot(combined_df_4bed_3car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 4 Bedrooms and 3 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_4bed_3car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 615250 769000 927646 1023250 3000000
ggplot(combined_df_4bed_4car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 4 Bedrooms and 4 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_4bed_4car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 330000 535500 655000 715113 840000 1630000
ggplot(combined_df_5bed_1car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 5 Bedrooms and 1 Carspace", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_5bed_1car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 335000 602500 722500 743607 870000 1850000
ggplot(combined_df_5bed_2car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 5 Bedrooms and 2 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_5bed_2car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200000 663748 820000 900051 960416 3080000
ggplot(combined_df_5bed_3car, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station for 5 Bedrooms and 3 Carspaces", x="Distance from Train Station(km)", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_5bed_3car$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 700000 738000 790000 787833 838250 910000
combined_df$Year <- as.factor(format(as.Date(combined_df$yearsold), "%Y"))
# Filtering by year
combined_df_0.00 <-filter(combined_df, distance_class == "(0,0.25]")
combined_df_0.25 <-filter(combined_df, distance_class == "(0.25,0.5]")
combined_df_0.50 <-filter(combined_df, distance_class == "(0.5,0.75]")
combined_df_0.75 <-filter(combined_df, distance_class == "(0.75,1]")
combined_df_1.00 <-filter(combined_df, distance_class == "(1,1.25]")
combined_df_1.25 <-filter(combined_df, distance_class == "(1.25,1.5]")
combined_df_1.50 <-filter(combined_df, distance_class == "(1.5,1.75]")
combined_df_1.75 <-filter(combined_df, distance_class == "(1.75,2]")
combined_df_2.00 <-filter(combined_df, distance_class == "(2,2.25]")
combined_df_2.25 <-filter(combined_df, distance_class == "(2.25,2.5]")
combined_df_2.50 <-filter(combined_df, distance_class == "(2.5,2.75]")
combined_df_2.75 <-filter(combined_df, distance_class == "(2.75,3]")
combined_df_3.00 <-filter(combined_df, distance_class == "(3,3.25]")
combined_df_3.25 <-filter(combined_df, distance_class == "(3.25,3.5]")
combined_df_3.50 <-filter(combined_df, distance_class == "(3.5,3.75]")
combined_df_3.75 <-filter(combined_df, distance_class == "(3.75,4]")
ggplot(combined_df_0.00, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 0 to 0.25km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_0.00$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50000 415000 565000 655697 780000 3300000
ggplot(combined_df_0.25, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 0.25 to 0.50km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_0.25$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000e+04 4.500e+05 5.975e+05 1.106e+06 7.910e+05 2.147e+09
ggplot(combined_df_0.50, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 0.50 to 0.75km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_0.50$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1650 425000 582500 652451 775000 15000000
ggplot(combined_df_0.75, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 0.75 to 1.00km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_0.75$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60000 420000 549475 612502 718375 6203000
ggplot(combined_df_1.00, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 1.00 to 1.25km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_1.00$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 575 392500 540000 606991 720000 4400000
ggplot(combined_df_1.25, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 1.25 to 1.50km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_1.25$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61325 378000 524000 572350 680000 4840000
ggplot(combined_df_1.50, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 1.50 to 1.75km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_1.50$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 92000 374462 522250 563676 650000 2812000
ggplot(combined_df_1.75, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 1.75 to 2.00km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_1.75$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100000 359250 509500 566946 680000 3100000
ggplot(combined_df_2.00, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 2.00 to 2.25km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_2.00$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 132500 370000 490275 529213 620000 2430000
ggplot(combined_df_2.25, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 2.25 to 2.50km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_2.25$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 190000 382500 550000 592289 687000 5346000
ggplot(combined_df_2.50, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 2.50 to 2.75km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_2.50$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
##
ggplot(combined_df_2.75, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 2.75 to 3.00km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_2.75$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
##
ggplot(combined_df_3.00, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 3.00 to 3.25km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3.00$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 180000 357500 500101 538112 600000 1777000
ggplot(combined_df_3.25, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 3.25 to 3.75km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3.25$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 225000 300000 460000 446274 543000 1125000
ggplot(combined_df_3.50, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 3.50 to 3.75km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3.50$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250000 300000 355000 386685 441000 664000
ggplot(combined_df_3.75, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs year for townhouses 3.75 to 4.00km from train station", x="Year", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))
summary(combined_df_3.75$soldprice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
##
ggplot(combined_df, aes(x = Year, y = soldprice/100000))+
geom_point(aes(color=distance_class)) +
labs(title = "Sold Price over Years", x="Year", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
ggplot(combined_df, aes(x = Year, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price over Years", x="Year", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
ggplot(combined_df, aes(x = factor(bedroom), y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price for Different Numbers of Bedrooms", x="Number of Bedrooms", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
ggplot(combined_df, aes(x = factor(bathroom), y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price for Different Numbers of Bathrooms", x="Number of Bathrooms", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
ggplot(combined_df, aes(x = factor(carspace), y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price for Different Numbers of Carspaces", x="Number of Carspaces", y="Selling Price (x$100000)")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
q1 <- quantile(combined_df$soldprice, 0.25)
q3 <- quantile(combined_df$soldprice, 0.75)
iqr <- q3 - q1
combined <- subset(combined_df, soldprice >= q1 - 1.5*iqr & soldprice <= q3 + 1.5*iqr)
# I changed the `na.rm` to be TRUE to remove all invalid N/A data points
Q1 <- quantile(combined_df$`distance_to_train_station(km)`, 0.25, na.rm = TRUE)
Q3 <- quantile(combined_df$`distance_to_train_station(km)`, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# What I've changed here at 7:05 AM, Apr 17, 2023, Monday
# `subset(combined_df ...` <- `subset(combined, ...`
combined <- subset(combined_df, `distance_to_train_station(km)` >= Q1 - 1.5*IQR & `distance_to_train_station(km)` <= Q3 + 1.5*IQR)
ggplot(combined, aes(x = distance_class, y = soldprice/100000))+
geom_boxplot(outlier.colour = "blue", outlier.size=1.5) +
labs(title = "Sold Price vs Distance from Train Station", x="Distance from Train Station(km)", y="Selling Price (x$100000)", fill = "Number of Carspaces")+
theme_bw()+
theme(axis.text.x = element_text(angle=45,hjust=1))+
theme(plot.title = element_text(hjust=0.25))
model <- lm(soldprice ~ `distance_to_train_station(km)`, data = combined)
plot(combined$"distance_to_train_station(km)", resid(model), main = "Residual Plot", xlab = "Distance to train station (km)", ylab = "Residuals", cex=0.15)
abline(h=0)